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ABSTRACT 

In compressed sensing (CS) framework, a signal is sampled 
below Nyquist rate, and the acquired compressed samples are 
generally random in nature. However, for efficient estima¬ 
tion of the actual signal, the sensing matrix must preserve the 
relative distances among the acquired compressed samples. 
Provided this condition is fulfilled, we show that CS samples 
will preserve the envelope of the actual signal even at different 
compression ratios. Exploiting this envelope preserving prop¬ 
erty of CS samples, we propose a new fast dictionary learn¬ 
ing (DL) algorithm which is able to extract prototype signals 
from compressive samples for efficient sparse representation 
and recovery of signals. These prototype signals are orthog¬ 
onal intrinsic mode functions (IMFs) extracted using empiri¬ 
cal mode decomposition (EMD), which is one of the popular 
methods to capture the envelope of a signal. The extracted 
IMFs are used to build the dictionary without even compre¬ 
hending the original signal or the sensing matrix. Moreover, 
one can build the dictionary on-line as new CS samples are 
available. In particularly, to recover first L signals (G R") at 
the decoder, one can build the dictionary in just 0(nL log n) 
operations, that is far less as compared to existing approaches. 
The efficiency of the proposed approach is demonstrated ex¬ 
perimentally for recovery of speech signals. 

Index Terms — Speech Processing, Compressed Sensing, 
dictionary learning, empirical mode decomposition. 

1. INTRODUCTION 

Compressed sensing (CS) or sparse signal representations 
have recently drawn much interest in the field of speech 
processing e.g., speech encryption 01 and speech recogni¬ 
tion 0. In particular, CS enables us to reconstruct a signal 
x G R™ which can be sparsely represented in an overcom¬ 
plete dictionary \F G R" x d ( d=n for complete dictionary), 
via recovery of its sparse representation a G R d from very 
few measurements y G R m sampled using a measurement 
matrix <I> G R mxn with m -C n 0 0. CS measurements 
are robust to degradations such as random perturbations or 
noise and does not require much memory for storage or trans¬ 


mission 0. In CS, although the signal acquisition is random, 
the obtained linear projections or measurements still preserve 
the relative distance between two signal points 0. This was 
supported by our observation that the compressive samples 
indeed preserves the envelope of the actual signal. It is a 
known fact that in case of speech signals, the signal envelope 
is very important in perception, e.g., the words are identified 
according to their envelope 0. Thus, this paper essentially 
focuses on speech signals. 

Exploiting the envelope preserving property of CS mea¬ 
surements, we propose a novel method where the aim is to 
express a speech signal as a sparse linear combination of pro¬ 
totype signals extracted from compressive speech samples di¬ 
rectly. These prototype signals, can be intrinsic mode func¬ 
tions (IMFs) extracted using empirical mode decomposition 
(EMD), which is one of the popular methods to capture the 
envelope of a signal. We show that the IMFs extracted from 
compressive speech show similar behavior to the ones ex¬ 
tracted from the speech signal directly. Hence, the extracted 
IMFs can be used to build the dictionary, using which one can 
recover the original speech signal from CS samples. 

1.1. Related Works 

The estimation of sparse vector (or equivalently the original 
signal) using compressive samples is very much influenced by 
the choice of dictionary 0. It has been shown that a sparse 
representation, estimated using a learned dictionary as com¬ 
pared to an analytic dictionary (e.g., DCT), results in better 
recovery of the signal |4j. Thus, the DF problem aims to find 
a dictionary 'I' such that the error, |x,; — l I'a, ||^ Vi is min¬ 
imized and is sparsest 0. Typically this is achieved by 
alternating minimization over a,/s and IF, i.e., the optimiza¬ 
tion is realized over one, keeping the other fixed 0. Details 
of various dictionary algorithms can be found in 0. Pro¬ 
vided the dictionary is available, one can efficiently recover 
a speech signal from compressive speech samples via recov¬ 
ery of its sparse representation 0. For instance, approaches 
in 0 and ED, recover a speech signal using a dictionary 
build from the pre-estimated vocal tract filter coefficients or 
line spectral frequency (FSF) code book derived from the 


training data. However, when only compressive samples are 
available, recovering the actual signal while simultaneously 
learning a dictionary is a difficult task. To address this is¬ 
sue, recent works have proposed some modified DL meth¬ 
ods (e.g., partial-KSVD ifTTh where the dictionary is learned 
from CS samples by minimizing the objective function ||y* — 
HI V». However, such DL methods are computationally 
expensive, and assume that the signal support set (non-zero 
index locations of sparse vector) is known a priori. Alter¬ 
natively, one can use recovery based DL methods, that are 
mathematically tractable compared to conventional methods 
ESI- Here, with an initial dictionary, the current estimate 
of the recovered signal from compressive samples is used to 
update the dictionary, and this procedure is performed iter¬ 
atively until convergence. Recovery based DL methods are 
essentially based on the concepts of blind compressed sens¬ 
ing Qz|. One such iterative DL approach for speech signals 
is presented in m. 

Nevertheless, applying CS on speech signals involve two 
main issues: (1) for speech signals (which has lot of varia¬ 
tions due to speaker, speaking style or spoken language) the 
dictionary should preferably be trained on speaker specific 
training data, which might not be available in each scenario 
and requires a huge amount of storage, (2) existing recov¬ 
ery based or conventional DL algorithms have large compu¬ 
tational complexity. 

1.2. Contributions of the Proposed Work 

In this paper, we propose a novel fast unsupervised DL ap¬ 
proach for recovery of compressive speech signals. As far 
as this work is concerned, we are interested in the scenario 
where only compressed measurements of the actual speech 
signal are available with out any prior knowledge of signal’s 
support set. We show that it is indeed possible to learn a 
dictionary from compressive speech samples, by bypassing 
the reconstruction of actual speech signal i.e., eliminating 
the abundant cost of recovering irrelevant data. To this 
aim, the dictionary is build using IMFs extracted directly 
from CS samples, without even comprehending the original 
speech signal or the sensing matrix used to acquire the signal. 
Moreover, the extracted IMFs being orthogonal results in a 
dictionary having good mutual coherence properties. It is 
worth emphasizing that the goal of the paper is not to outper¬ 
form a state-of-the-art CS recovery method but is to propose 
an approach which can perform with an acceptable level of 
accuracy in heavily resource-constrained environments, both 
in terms of storage and computation. To the best of our 
knowledge, none of the previous papers have proposed such 
methods for compressively sensed signals. 

The rest of the paper is organized as follows: In Section[2] 
we briefly explains the modeling of speech signals using CS 
framework, and how envelope of a speech signal is preserved 
in compressive samples. In Section[3]we propose an efficient 


DL algorithm for compressive speech signals using EMD, and 
the experimental results are shown in Section[4] The summary 
of paper is given in Section^ 

2. MODELING SPEECH SIGNALS USING CS 

In CS framework, signals are sampled at less than the Nyquist 
rate (8). In particular, given a matrix Y £ R m x 1 consisting 
of L compressive speech signal frames {y, as columns, 
the recovery of the corresponding signal set (X £ M. nxl ) is 
formulated as: 

X ~ \kA where A is computed as, 

A = argmin /(A) : ||Y - $*A||| = ||Y - DA||| < e, 

A 

where A £ W lxl is the sparse coefficient matrix correspond¬ 
ing to X, e is the error tolerance, /() is a function (e.g., li- 
norm) that promotes sparsity and D £ R mxd is the overall 
dictionary. According to CS theory, if the matrix 3> satis¬ 
fies restricted isometry property (RIP), and is incoherent with 
the dictionary IP, the signal can be recovered with very high 
probability by linear programming methods 0. 

2.1. Randomness Do Make Sense: Properties of Com¬ 
pressive Samples 

CS acquires random signal measurements!]] and hence do not 
preserve any signal structures in their raw form. However, 
these linear projection acquired using d>, which satisfies the 
RIP property, still preserves the relative distance between two 
signal points or vectors 0, i.e.,: 

||$(xi - x 2 )III « ||xi - x 2 III V xi,x 2 £ 1" (2) 

Moreover, the mean of measured energy is exactly equal to 
11 x 111 i-e., E [H^xlll] = 11 x 11 2 - To illustrate this. Fig. Q] 
shows a example of the original and compressively sensed 
speech signal. Note that the sampling rate of compressive 
speech is less than that of the original speech signal, and for 
a fair comparison, the interpolated compressive speech, com¬ 
puted using cosine interpolation is plotted in the figure. It 
can be observed that though the measurement vector exhibits 
some random noise-like nature, envelopes of both the original 
and the compressive speech signal are approximately similar, 
even at different compression ratios. In other words, the pre¬ 
served structure of the instance space i.e., speech signal is 
more prominent if viewed globally or in longer windows. 

To exploit this preserved envelope, one may decompose 
a compressive speech signal to extract prototype signals to 
build the dictionary. One way to achieve this is to apply EMD 
on compressive speech signal. EMD exploits the signal en¬ 
velope or evolution of a signal between two consecutive lo¬ 
cal extrema to decompose a signal into orthogonal modes or 
IMFs, which can be used as dictionary atoms. 

'The elements of the sensing matrix are assumed to be i.i.d. random 
variables 
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Fig. 1. Comparison of the envelopes (manually marked red) of (a) 
original speech signal, (b) and (c) interpolated compressive speech 
signal orignally sampled at compression ratio (m/n) of 0.7 and 0.5 
respectively. 
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Fig. 3. EMD decomposition of a voiced frame of (a) orignal speech 
signal and (b) interpolated compressive speech signal 


Fig. 2. EMD decomposition of a voiced frame of compressive 
speech sampled at compression ratio (m/n) of 0.5 


3. CS-EMD: A FAST DICTIONARY LEARNING 
APPROACH FOR COMPRESSIVELY SENSED 
SPEECH SIGNALS 

The proposed approach is a exemplar based approach where 
a speech frame is sparsely represented as a linear combina¬ 
tion of few IMFs from the dictionary, selected optimally us¬ 
ing sparsity constraints. However, the IMFs used to build the 
dictionary are extracted directly from CS samples. Using the 
EMD method a given compressive speech frame y can be ex¬ 
pressed as 

j 

y = ^2 m 9 + r (3) 

9=1 

i.e., a sum of J orthogonal modes m g £ R m and a residual 
trend r £ R m [14) • In order to achieve efficient decomposi¬ 
tion, our approach uses the modified EMD algorithms called 


the Ensemble Empirical Mode Decomposition (EEMD) as 
proposed in (T3). Figs. [I] and [3} a) shows an example of 
compressive and corresponding original voiced speech frame 
along with the first 5 extracted IMFs respectively. One can 
observe that most of the IMFs extracted using compressive 
samples (Fig. O show similar behavior as in case of the IMFs 
extracted using raw speech samples (Fig. [3j a)). Thus, one 
can use these IMFs directly to build the dictionary. The ex¬ 
tracted IMFs being orthogonal results in a dictionary having 
good coherence bounds. Further, the biggest advantage of the 
proposed approach is its time complexity, which follows from 
the fact that extracting IMFs and building the dictionary does 
not require the sensing matrix to be known. However, there 
are still two major issues in building the dictionary in order 
to recover the signal: (1) dimensionality of dictionary atoms, 
and (2) building a dictionary of appropriate size. 







































3.1. Dimensionality of dictionary atoms 

In order to recover the original speech frame, the dimension¬ 
ality of each dictionary atom should be equal to that of the 
speech frame (see Eq. <01 )■ However, any extracted IMFs 
from the CS measurement vector will have low dimensional¬ 
ity. Further, low sampling rates also affects the performance 
of EMD. It has been shown that EMD can still be effective 
(within tolerable limits) if the signal is interpolated such as by 
Fourier and cosine interpolation methods. Hence, we used the 
raised cosine EEMD method fl6l (with roll-off factor /3 = 1) 
to extract IMFs of appropriate dimensions. As an illustration, 
we have plotted the extracted IMFs of the compressive speech 
frame considered in Fig. [Jafter interpolation using EEMD in 
Fig. 0(b). It can be observe that the IMFs are now more struc¬ 
tured as in case of the original speech signal, and can help in 
learning a better dictionary. 

3.2. Dictionary size 

Speech signal is generally processed on short frame basis, and 
a dictionary build using extracted J IMFs for each compres¬ 
sive speech frame will make it highly overcomplete. How¬ 
ever, note that an IMF at each level of decomposition has 
different scale and structural information. Hence, to restrict 
the atoms in the dictionary to a desired number, the extracted 
IMFs from the J th level across all frames are clustered using 
K-means algorithm. Now the cluster centers are used as dic¬ 
tionary atoms and the number of clusters depends on number 
of atoms we wish the dictionary to have from each level. To 
have a sparser representation, more atoms should come from 
initial levels which contains more structures/patterns as com¬ 
pared to other levels. Algorithm!]] shows the pseudo-code of 
the proposed approach. 

Note that apart from the presented approach, one is free to 
explore any variation of EMD algorithm, clustering approach 
or some other optimal way to build the dictionary from the ex¬ 
tracted IMFs. Also, apart from batch processing on all com¬ 
pressive frames, the dictionary can be learned on-line, where 
the dictionary atoms are updated as soon as a new frame is 
available for processing. 

3.3. Computational Complexity 

The time complexity of EMD for extracting all possible IMFs 
from L /(-dimensional signal frames approximately scales to 
0(nL\ogn), that is equal to that of Fast Fourier transform. 
Further, the complexity of clustering using K-means algo¬ 
rithm is approximately 0(nLI\i ), where I\ is the number 
of clusters and i the number of iterations until convergence. 
Thus, the overall complexity of the proposed approach is less 
as compared to conventional DF methods, for which the time 
complexity per iteration scales to 0(n * 2 L), and in some cases 
to 0(n 3 L) 0. 


Algorithm 1 CS-EMD algorithm 

Inputs: Compressive signal matrix Y = [yi ... y /J , and sensing matrix ‘E 
Outputs: Recovered signal matrix X = [xi ... x/J 
Initialization: >]> = [■], J, e, fi and K q V q s.t. d = K q 
Preprocessing Stage q 

1: Compute Y = [y ^ ... y L ], using cosine interpolation on Y 
Dictionary Learning stage 
for: i .It o /. 

2: Compute J IMFs m q i, q = 1 ... J from y i using EMD 

end for 

for: q = 1 to J 

3: Collect q th IMFs m, /v . i = 1... L as a column of matrix M q 
4: Cluster columns of matrix M 9 into K q clusters 
5 : Collect cluster centroids as columns of matrix C /<- 
6: Update Dictionary using cluster centroids as = [\P | C ^ ] 

end for 

Sparse Coding and Signal Recovery stage 

7: Solve A = argmin || A||i s.t. ||Y — $iI>A||^, = ||Y — DA||J, < e 
8: Recover signal matrix as X Ri IRA 


4. EXPERIMENTAL RESULTS 

In each experiment, speech is processed on a short time frame 
basis, where framing is achieved by applying a 50 ms long 
Hanning window with the frame overlap set to 50%. The 
sensing matrix <t> is chosen to be a random Gaussian matrix 
with a compression ratio m/n = 0.5 unless otherwise stated. 
The maximum number of IMFs extracted using EEMD (with 
noise realizations N e = 50) for each compressive speech 
frame is set to 5. A dictionary containing 600 atoms is learned 
for each speech utterance (sampled at 8 KHz) taken from 
KED TIMIT corpus JT7J. As initial IMF levels contribute 
more towards the overall signal approximation the number 
of dictionary atoms chosen empirically from each IMF level 
across all frames after clustering are 140, 140, 110, 110, and 
100 respectively. We conducted experiments on a Quad-Core 
Intel i7 machine at 3.5 GHz, 12 Gb RAM, using MATLAB 
and under Win8 operating system. For reasons of brevity, we 
shall focus on signal recovery, but the proposed dictionary can 
be readily applied to other speech applications also. 

4.1. Speech recovery from compressive measurements 

In this experiment we assumed that only compressive mea¬ 
surements of a speech utterance are available at the decoder. 
We considered multiple speech utterances, and for each one 
a dictionary is learned using the method presented in Section 

[3] The learned dictionary is then applied in CS framework 
to obtain the sparse representation of each speech frame us¬ 
ing ^-minimization, solved using YALL1 package US- The 
speech utterance was then reconstructed using standard over¬ 
lap and add method. 

Figure [4] shows an example of the original and the re¬ 
constructed speech waveform, along with spectrogram plots 
shown in Figure^ One can observe that the proposed method 
is able to recover the speech signal well. However, as ob- 






served in Fig. 0[b), the first few extracted IMFs are generally 
corrupted, and as a results the higher frequency bands of the 
recovered speech are also distorted. This is also supported 
by a lower perceptual evaluation of speech quality (PESQ) 
score for the recovered speech using the proposed approach, 
compared to other recovery based DL methods as shown in 
Table [Q However, some reduction in speech quality is ac¬ 
ceptable, considering the time complexity gain achieved via 
the proposed approach. To illustrate this. Table Q] also show 
the average CPU run times to recover a speech utterance of 
approximately 3 sec (including the time for dictionary learn¬ 
ing), and the results confirms that the proposed approach is 
indeed fast compared to existing approaches. Note that for 
the proposed approach run time is dominated by sparse cod¬ 
ing stage. 

4.1.1. Discussion 

Our experiments shows that one can recover a speech signal 
directly from compressive samples, provided the CS measure¬ 
ments preserve structural properties of the speech signal. The 
choice of sensing matrix is crucial and if a sensing matrix 
is carefully chosen or designed one can improve the perfor¬ 
mance of the proposed approach by learning a better dictio¬ 
nary. In fact, compared to random matrices such as Gaus¬ 
sian/Bernoulli matrices, the performance of the proposed ap¬ 
proach increases (as shown in Table |T]), in case of efficiently 
designed matrices such as sparse Gaussian and structurally 
random matricefl All such matrices do preserve the enve¬ 
lope but fails to preserve the pitch related speech variations 
in the extracted IMFs, and hence they result in poor recovery 
as compared to other recovery based methods. Note that our 
goal is to recover speech signals from CS measurements at 
the decoder having limited resources both in terms of storage 
and computation. 

In fact, the extracted IMFs can reveal important properties 
about speech segments. Hence, the proposed approach is also 
promising in various inference problems where actual signal 
recovery is not required, and only CS samples (which require 
less memory) are available e.g., voiced/nonvoiced speech de¬ 
tection fl9l . In such cases, there is even no need to know 
anything about the sensing matrix used to acquire the signal. 
However, we defer this or any other extensions to future work. 

5. SUMMARY 

In this paper, we have proposed a fast reconstruction free DL 
approach for compressive speech signals. We show that it is 
indeed possible to learn a dictionary using only compressive 
speech samples, and hence the proposed approach is promis¬ 
ing in resource-constrained environments. EMD decompo¬ 
sitions of compressive speech samples are used to form the 

2 We observed only marginal improvement when sensing matrix other than 
Gaussian was employed with existing approaches. 


Table 1 . Comparitive Analysis of Different Methods for Signal Recovery 
averaged for 20 utterances over 10 trials. 


Method 

CS Matrix 

DL Iterations 

PESQ 

Runtime 

CS-EMD 

Sparse-Gaussian 

N.A 

2.92 

0.83 min 

SRM (20) 

2.91 

Gaussian 

2.90 

Bernoulli l2ll 

2.84 

Blind CS 

Gaussian 

20 

2.97 

5 min 

IHT 

Gaussian 

20 

3.10 

3 min 
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Fig. 4. (a) Original speech signal, (b) Recovered speech 
signal from compressed measurements at compression ratio 
m/n of 0.5. 



Fig. 5. Spectrogram of (a) original speech signal; (b) and (c) 
recovered speech signal from compressed measurements at 
compression ratio m/n of 0.5 and 0.7 respectively. 


atoms of the dictionary, and is motivated by the fact that CS 
samples have envelop similar to the envelop of original speech 
samples. Preliminary result on signal recovery experiment, 
show that the proposed approach can be an alternative to the 
existing explicit and implicit CS recovery methods. The full 
potential of this new approach is yet to be realized, and ad¬ 
ditional work is required to establish the gains. In our future 
research, we wish to extend this approach to some inference 
problems, where actual signal recovery is not required. One 
possible extension is to incorporate the proposed approach in 
various speech applications such as voice activity detection, 
speaker identification or speech recognition. 
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